In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.
According to britannica, a mail-order business is a
method of merchandising in which the seller’s offer is made through mass mailing of a circular or catalog or through an advertisement placed in a newspaper or magazine and in which the buyer places an order by mail.
# update joblib
!pip install --ignore-installed joblib==0.14
# update sklearn
!pip install --upgrade scikit-learn
# install imblearn
!pip install imblearn
# import libraries here; add more as necessary
import pickle
import warnings
import gc
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from collections import defaultdict
from datetime import datetime
from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.base import clone
from sklearn.cluster import MiniBatchKMeans
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile, RFE, VarianceThreshold, chi2
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from tqdm import tqdm
# stop warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# magic word for producing visualizations in notebook
%matplotlib inline
read_pickle = False
seed = 42
pd.set_option('max_columns', 120)
pd.set_option('max_colwidth', 5000)
plt.rcParams['figure.dpi'] = 120
def reduce_mem_usage(props):
start_mem_usg = props.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_mem_usg," MB")
NAlist = [] # Keeps track of columns that have missing values filled in.
for col in props.columns:
if props[col].dtype != object: # Exclude strings
# Print current column type
print("******************************")
print("Column: ",col)
print("dtype before: ",props[col].dtype)
# make variables for Int, max and min
IsInt = False
mx = props[col].max()
mn = props[col].min()
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(props[col]).all():
print("Filled NA with", mx+1)
NAlist.append(col)
props[col].fillna(mx+1, inplace=True)
# test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
# Refill with NA
if col in NAlist:
print(f"Replaced {mx+1} with NA")
props[col].replace(mx+1, np.nan, inplace=True)
props.loc[:, col] = props[col].astype(np.float16)
elif mn >= 0:
if mx < 255:
props.loc[:, col] = props[col].astype(np.uint8)
elif mx < 65535:
props.loc[:, col] = props[col].astype(np.uint16)
elif mx < 4294967295:
props.loc[:, col] = props[col].astype(np.uint32)
else:
props.loc[:, col] = props[col].astype(np.uint64)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
props.loc[:, col] = props[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
props.loc[:, col] = props[col].astype(np.int16)
elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
props.loc[:, col] = props[col].astype(np.int32)
elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
props.loc[:, col] = props[col].astype(np.int64)
# Make float datatypes 16 bit
else:
props.loc[:, col] = props[col].astype(np.float16)
# Print new column type
print("dtype after: ",props[col].dtype)
print("******************************")
# Print final result
print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = props.memory_usage().sum() / 1024**2
print("Memory usage is: ",mem_usg," MB")
print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
return props, NAlist
def get_null_prop(df, axis=0, plot=True):
"""Calculates null proportions in dataframe columns or rows."""
# calculate null proportion of each column or row
null_prop = (df.isnull().sum(axis=axis) / df.shape[axis]).sort_values()
if plot:
null_prop.hist(edgecolor='black', linewidth=1.2)
return null_prop
def replace_unknown_with_null(df):
"""
Replace unknown values encoded as 0 and -1 into null values.
"""
# Read Values sheets which has information about unknown encodings
dias_vals = pd.read_excel("DIAS Attributes - Values 2017.xlsx")
# Find unknown values for features
feat_unknown_vals = dias_vals.query("Meaning == 'unknown'")
# Replace unknown values for each feature in the dataframe
for feat in feat_unknown_vals.itertuples():
# Check if feature in dataframe featuers
if feat.Attribute in df.columns:
# if unknown values are more than one
if ',' in str(feat.Value):
# loop over unknown values
for val in str(feat.Value).split(','):
# replace unknown value with null
df[feat.Attribute].replace(eval(val), np.nan, inplace=True)
else:
# replace unknown value with null
df[feat.Attribute].replace(feat.Value, np.nan, inplace=True)
# Replace other unknown values that aren't in Values sheet
df["ALTER_HH"].replace(0, np.nan, inplace=True)
df["KOMBIALTER"].replace(0, np.nan, inplace=True)
# Replace Non-binary features with 0 and -1 that are of unknown category to null
unknown_feats = df[list(set(df.columns).difference(dias_vals.Attribute))].copy()
for feat in unknown_feats:
if df[feat].nunique() > 2:
df[feat].replace(-1, np.nan, inplace=True)
df[feat].replace(0, np.nan, inplace=True)
return df
def compare_features(dfs=[], labels=[]):
"""Plots all features of passes dataframes for comparison"""
# parameters of subplot
nrows = 1
ncols = 2
# number of features
nfeats = feat_cat_df.shape[0]
# colors
colors = "bgrcmy"
# loop over 4 features at a time to plot them in a row
for i in range(0, nfeats, ncols):
# make subplots
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 3));
# loop over each of 4 features
for j, feat_idx in enumerate(range(i, i+ncols)):
# feature name and category
feat = feat_cat_df.iloc[feat_idx].feature
cat = feat_cat_df.iloc[feat_idx].category
# ldict with label as key and df feature as value
dfs_feat = {}
for df, label in zip(dfs, labels):
dfs_feat[label] = df.loc[:, feat]
# plot using histogram if unique values exceeds 11
if dfs_feat[label].nunique() > 16:
for (label, df_feat), color in zip(dfs_feat.items(), colors):
df_feat.hist(bins=20, color=color, density=True, alpha=0.4, ax=axes[j], label=label)
axes[j].legend()
# plot using bar chart
else:
columns = []
feats_p = None
# concatenate all df features values counts in one dataframe and plot bar plot
for label, df_feat in dfs_feat.items():
columns.append(label)
feat_p = df_feat.value_counts()/df_feat.shape[0]
feats_p = pd.concat([feats_p, feat_p], axis=1)
feats_p.columns = columns
feats_p.sort_index().plot(kind="bar", ax=axes[j])
# assign unknown feature description
feat_desc = 'Unknown'
# if feature description exists include instead of unknown
if feat in known_feats:
feat_desc = dias_atts[dias_atts["Attribute"] == feat]["Description"].item()
# set title as feature, category and description
axes[j].set_title(f"{feat}\n{cat}\n{feat_desc}")
def clean_dataset(df, p_row=0.5, p_col=0.5, drop_uncertain=True, keep_features=[]):
"""
Clean dataset using insights gained during EDA.
inputs
1. df (pandas dataframe)
2. p_thresh (float) - maximum threshold of null values in columns
3. drop_uncertain (bool) - drop features we are uncertain from
"""
# Make a new copy of the dataframe
clean_df = df.copy()
# Clean columns with mixed dtypes
clean_df["CAMEO_DEUG_2015"].replace('X', np.nan, inplace=True)
clean_df["CAMEO_INTL_2015"].replace('XX', np.nan, inplace=True)
# Replace unknown values with missing
clean_df = replace_unknown_with_null(clean_df)
# Drop rows with more than 50% missing values
min_count = int(((1 - p_row))*clean_df.shape[1] + 1)
clean_df.dropna(axis=0, thresh=min_count, inplace=True)
# Drop duplicated rows
clean_df.drop_duplicates(inplace=True)
# Drop GEBURTSJAHR (year of birth) that has 44% missing values
clean_df.drop('GEBURTSJAHR', axis=1, inplace=True)
# Drop LNR which is a unique indentifier
clean_df.drop('LNR', axis=1, inplace=True)
# Drop CAMEO_DEU_2015 as it's not suitable for PCA
clean_df.drop('CAMEO_DEU_2015', axis=1, inplace=True)
# Drop features with more than p_thresh missing values
features_missing_p = clean_df.isna().sum() / clean_df.shape[0]
features_above_p_thresh = clean_df.columns[features_missing_p > p_col]
features_to_keep = ["ALTER_KIND1", "ALTER_KIND2", "ALTER_KIND3", "ALTER_KIND4", "ANZ_KINDER"] + keep_features
features_to_remove = [feat for feat in features_above_p_thresh if feat not in features_to_keep]
clean_df.drop(features_to_remove, axis=1, inplace=True)
# Drop uncertain features
if drop_uncertain:
uncertain_features = ["GEMEINDETYP"]
clean_df.drop(uncertain_features, axis=1, inplace=True)
# Feature Engineering
# One Hot Encoding D19_LETZTER_KAUF_BRANCHE
dummies = pd.get_dummies(clean_df["D19_LETZTER_KAUF_BRANCHE"], prefix="D19_LETZTER_KAUF_BRANCHE")
clean_df = pd.concat([clean_df, dummies], axis=1)
clean_df.drop("D19_LETZTER_KAUF_BRANCHE", axis=1, inplace=True)
# Calculate year difference in MIN_GEBAEUDEJAHR
clean_df["MIN_GEBAEUDEJAHR_ENG"] = (2017 - clean_df["MIN_GEBAEUDEJAHR"])
clean_df.drop("MIN_GEBAEUDEJAHR", axis=1, inplace=True)
# Calculate days difference in EINGEFUEGT_AM
current = datetime.strptime("2017-01-01", "%Y-%m-%d")
clean_df["EINGEFUEGT_AM_DAY"] = (current - pd.to_datetime(clean_df["EINGEFUEGT_AM"])).dt.days
clean_df.drop("EINGEFUEGT_AM", axis=1, inplace=True)
# Replace null values in ALTER_KIND and ANZ_KINDER with 0 to avoid imputation
for feat in clean_df.columns[clean_df.columns.str.startswith("ALTER_KIND")]:
clean_df[feat].replace(np.nan, 0, inplace=True)
clean_df["ANZ_KINDER"].replace(np.nan, 0, inplace=True)
# Convert OST_WEST_KZ to binary labels
clean_df["OST_WEST_KZ"] = (clean_df["OST_WEST_KZ"] == "W").astype(np.uint8)
# Convert CAMEO_INTL_2015 and CAMEO_DEUG_2015 to float32
CAMEO_feats = ["CAMEO_INTL_2015", "CAMEO_DEUG_2015"]
clean_df[CAMEO_feats] = clean_df[CAMEO_feats].astype(np.float32)
# Convert float16 features to float32 to enable arithmetic operations
float_feats = clean_df.select_dtypes(np.float16).columns
clean_df[float_feats] = clean_df[float_feats].astype(np.float32)
return clean_df
def batch_fit_scaler(scaler, data, n_batches=100):
"""
Fit a scaler to data through n batches.
Input:
transfromer (scikit-learn Scaler object)
data (numpy array or pandas dataframe)
n_batches (int) - number of batches for fitting the scaler
Output:
Fitted Scaler
"""
for X_batch in tqdm(np.array_split(data, n_batches), total=n_batches):
scaler.partial_fit(X_batch)
return scaler
def batch_fit_pca(pca, scaler, data, n_batches=100):
"""
Fit an Incremental PCA to scaled data through n batches.
Input:
pca (scikit-learn IncrementalPCA object)
scaler (scikit-learn Scaler object)
data (numpy array or pandas dataframe)
n_batches (int) - number of batches for fitting the transformer
Output:
Fitted IncrementalPCA
"""
for X_batch in tqdm(np.array_split(data, n_batches), total=n_batches):
scaled_X_batch = scaler.transform(X_batch)
pca.partial_fit(scaled_X_batch)
return pca
def batch_transform_pca(pca, scaler, data, n_batches=100):
"""
Transform large data using fitted pca.
Input:
pca (Fitted IncrementalPCA)
scaler (Fitted Scaler)
data (numpy array or pandas dataframe)
n_batches (int) - number of batches for fitting the transformer
Output:
Transformed data
"""
pca_data = None
for X_batch in tqdm(np.array_split(data, n_batches), total=n_batches):
scaled_X_batch = scaler.transform(X_batch)
pca_X_batch = pca.transform(scaled_X_batch)
if pca_data is None:
pca_data = pca_X_batch
else:
pca_data = np.vstack([pca_data, pca_X_batch])
return pca_data
def plot_multi_bar(feat, dfs_feat, ax, color):
columns = []
feats_p = None
# concatenate all df features values counts in one dataframe and plot bar plot
for label, df_feat in dfs_feat.items():
columns.append(label)
feat_p = df_feat.value_counts()/df_feat.shape[0]
feats_p = pd.concat([feats_p, feat_p], axis=1)
feats_p.columns = columns
# map value ticks to their meanings
vals_dict = get_feat_val_meaning(feat)
feats_p.sort_index(inplace=True)
feats_p.index = feats_p.index.map(vals_dict)
feats_p.plot(kind="bar", color=color, ax=ax);
def get_feat_val_meaning(feat):
# filter out featue
feat_vals = dias_vals[dias_vals["Attribute"] == feat]
# make a series of Value and Meaning with Value as index
feat_vals_meaning = feat_vals[["Value", "Meaning"]].set_index("Value")
# convert series to dict
vals_dict = feat_vals_meaning.to_dict()["Meaning"]
return vals_dict
def compare_feature(feat, title):
fig, axes = plt.subplots(1, 2, figsize=(16, 4))
# plot feature for customers an non-customers
dfs_feat = {"Customers": customers[feat],
"Non-Customers": non_customers[feat]}
plot_multi_bar(feat, dfs_feat, axes[0], ["tab:blue", "grey"])
# set titles of the subplots
axes[0].set_title(f"{title} between Customers and Non-Customers")
axes[1].set_title(f"{title} between Customer Clusters")
# plot feature for customers' clusters
dfs_feat = {"Cluster 0": cluster_0[feat],
"Cluster 4": cluster_4[feat],
"Cluster 6": cluster_6[feat]}
plot_multi_bar(feat, dfs_feat, axes[1], ["#878cad", "#4b527c", "#323752"])
def model_validation(model, X, y, fold_results=False, final_results=True, return_scores=False, return_preds=False, plot_confusion=False, metric=roc_auc_score):
"""Evaluate model using sklearn's classification report."""
# Instantiate StratifiedKFold
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
# Make empty y_pred to fill predictions of each fold
y_pred = np.zeros(y.shape)
# Initialize empty list for scores
scores = []
for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
# Make copy of model
model_clone = clone(model)
# Fit model
model_clone.fit(X_train, y_train)
# Predict fold y_test and add in y_pred
fold_pred = model_clone.predict(X_test)
y_pred[test_idx] += fold_pred
# Print classification report
if fold_results:
print("Fold:", i+1)
print(classification_report(y_test, fold_pred))
# Calculate metric scores (default is roc_auc_score)
scores.append(metric(y_test, fold_pred, average="macro"))
# Print final classification report
if final_results:
print("Final Report:")
print(classification_report(y, y_pred))
print("Metric Score:", np.mean(scores))
# Return metric scores if passed to the function
if return_scores:
return scores
# Plot confusion matrix if specified
if plot_confusion:
ax = plt.subplot()
cmat = confusion_matrix(y, y_pred)
sns.heatmap(cmat, annot=True, fmt="g", ax=ax)
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['No Response', 'Response']); ax.yaxis.set_ticklabels(['No Response', 'Response']);
def evaluate_models(models_dict, X, y, scoring="f1_macro"):
"""Evaluate several models using sklearn's cross_val_score."""
model_scores = {}
for name, model in models_dict.items():
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
scores = cross_val_score(model, X, y, cv=skf, scoring=scoring)
model_scores[name] = scores
print("Model:%s, Score:%.3f (+/- %.3f)" % (name, np.mean(scores), np.std(scores)))
return model_scores
There are four data files associated with this project:
Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.
The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.
Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.
In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.
You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.
# csv data path
data_path = '../../data/Term2/capstone/arvato_data/'
# csv files
azdias_csv = 'Udacity_AZDIAS_052018.csv'
customers_csv = 'Udacity_CUSTOMERS_052018.csv'
# csv file paths
azdias_csv_path = os.path.join(data_path, azdias_csv)
customers_csv_path = os.path.join(data_path, customers_csv)
# pickle files
azdias_pickle = 'Udacity_AZDIAS_052018.csv.pandas.pickle'
customers_pickle = 'Udacity_CUSTOMERS_052018.csv.pandas.pickle'
# load pickled datasets if exists
if azdias_pickle in os.listdir() and customers_pickle in os.listdir():
print("Loading AZDIAS pickle...")
azdias = pd.read_pickle(azdias_pickle)
print("Loading CUSTOMERS pickle...")
customers = pd.read_pickle(customers_pickle)
read_pickle = True
# else load csv and save pickles
else:
print("Loading AZDIAS csv...")
azdias = pd.read_csv(azdias_csv_path, sep=';')
print("Loading CUSTOMERS csv...")
customers = pd.read_csv(customers_csv_path, sep=';')
print("Loading Attributes Sheet...")
dias_atts = pd.read_excel("DIAS Information Levels - Attributes 2017.xlsx")
print("Loading Values Sheet...")
dias_vals = pd.read_excel("DIAS Attributes - Values 2017.xlsx")
print("Done.")
azdias.iloc[:, [18, 19]].head()
customers.iloc[:, [18, 19]].head()
azdias['CAMEO_DEUG_2015'].value_counts()
customers['CAMEO_DEUG_2015'].value_counts()
azdias['CAMEO_INTL_2015'].value_counts()
customers['CAMEO_INTL_2015'].value_counts()
There are 373 rows in AZDIAS that have X in CAMEO_DEUG_2015 and XX in CAMEO_INTL_2015, while there are 126 rows in customers that have XX in the same two features.
According to TransunionUK, CAMEO is a consumer segmentation system linking address information to demographic, lifestyle and socio-economic insight. So it's basically data obtained from a different information source, and that's why rows that have missing value in one feature have it also missing in the other.
We can fill these rows with NaNs for now and continue with exploring the dataset.
feats = ["CAMEO_INTL_2015", "CAMEO_DEUG_2015"]
for feat in feats:
azdias[feat].replace({"X": np.nan}, inplace=True)
customers[feat].replace({"XX": np.nan}, inplace=True)
# looking into the general population dataset
print('Shape:', azdias.shape)
azdias.head()
# looking into the customers dataset
print('Shape:', customers.shape)
customers.head()
# general population dataset info
azdias.info()
# customers dataset info
customers.info()
if not read_pickle:
azdias_p, azdias_na = reduce_mem_usage(azdias)
if not read_pickle:
customers_p, customers_na = reduce_mem_usage(customers)
# save datasets in pickle format for later usage
if not read_pickle:
pd.to_pickle(azdias_p, azdias_pickle)
pd.to_pickle(customers_p, customers_pickle)
azdias_feats = set(azdias.columns)
customers_feats = set(customers.columns)
print("Features in customers not in azdias:")
print(customers_feats.difference(azdias_feats))
dias_atts.head()
dias_vals.head(10)
print("Number of features in Values:", dias_vals.Attribute.dropna().nunique())
print("Number of features in Attributes:", dias_atts.Attribute.dropna().nunique())
dias_vals_feats = set(dias_vals.Attribute.dropna())
dias_atts_feats = set(dias_atts.Attribute.dropna())
print("Features in Values not in Attributes:")
for feat in dias_vals_feats.difference(dias_atts_feats):
print(feat)
print("\nFeatures in Attributes not in Values:")
for feat in dias_atts_feats.difference(dias_vals_feats):
print(feat)
# set of features that have wrong format
wrong_feats = dias_atts_feats.difference(dias_vals_feats)
dias_atts.query("Attribute in @wrong_feats")
print("Features in Attributes not in azdias:")
for feat in dias_vals_feats.difference(azdias_feats):
print(feat)
print("\nFeatures in Values not in azdias:")
for feat in dias_atts_feats.difference(azdias_feats):
print(feat)
vals_in_azdias = dias_vals_feats.intersection(azdias_feats)
atts_in_azdias = dias_atts_feats.intersection(azdias_feats)
print("Number of featuers in Azdias:", azdias.shape[1])
print("Number of features shared between Azdias and Values:", len(vals_in_azdias))
print("Number of features shared between Azdias and Attributes:", len(atts_in_azdias))
print("Features in Azdias not in Values or Attributes:")
lost_feats = sorted(list(set(azdias_feats).difference(atts_in_azdias)))
for feat in lost_feats:
print(feat)
print("Features in Values not Attributes:")
for feat in vals_in_azdias.difference(atts_in_azdias):
print(feat)
azdias_null_cols = get_null_prop(azdias, plot=False)
customers_null_cols = get_null_prop(customers, plot=False)
azdias_null_cols.hist(bins=10, alpha=0.7, label='AZDIAS')
customers_null_cols.hist(bins=10, alpha=0.7, label='CUSTOMERS')
plt.title("Distribution of Null Values in Columns");
plt.legend()
plt.tight_layout()
plt.savefig("null_columns.png");
azdias_null_rows = get_null_prop(azdias, 1, False)
customers_null_rows = get_null_prop(customers, 1, False)
azdias_null_rows.hist(bins=10, alpha=0.7, label='AZDIAS')
customers_null_rows.hist(bins=10, alpha=0.7, label='CUSTOMERS')
plt.title("Distribution of Null Values in Rows");
plt.legend()
plt.tight_layout()
plt.savefig("null_rows.png");
And since we touched on missing values, we can notice in the Values sheet that there are unknown values present in each feature, but at the same time we have null values.
# find the unknown values associated with each feature
feat_unknown_vals = dias_vals.query("Meaning == 'unknown'")
feat_unknown_vals.head()
# replace unknown values with null
azdias_new = replace_unknown_with_null(azdias)
azdias_new_null_cols = get_null_prop(azdias_new, plot=False)
azdias_new_null_rows = get_null_prop(azdias_new, axis=1, plot=False)
azdias_new_null_cols.hist(bins=20, alpha=0.7, label='New AZDIAS')
azdias_null_cols.hist(bins=20, alpha=0.7, label='Old AZDIAS')
plt.title("Distribution of Null Values in Columns");
plt.legend()
plt.tight_layout()
plt.savefig("null_columns_azdias.png");
azdias_new_null_rows.hist(bins=20, alpha=0.7, label='New AZDIAS')
azdias_null_rows.hist(bins=20, alpha=0.7, label='Old AZDIAS')
plt.title("Distribution of Null Values in Rows");
plt.legend()
plt.tight_layout()
plt.savefig("null_rows_azdias.png");
This will depend completely on the type of feature we are dealing with. As missing values can be divided into 3 categories:
We need to take a look into the Attributes dataframe first.
dias_atts.head()
Information level is the feature that has the category, but there are null values as this was a spreadsheet with merged cells. So we need to forwardfill, then backfill the first null value.
dias_atts["Information level"] = dias_atts["Information level"].fillna(method="ffill", axis=0)
dias_atts["Information level"] = dias_atts["Information level"].fillna(method="bfill", axis=0)
dias_atts.head()
# remove features that aren't in Azdias
dias_atts = dias_atts.query("Attribute in @azdias_feats")
print("Number of feature categories:", dias_atts["Information level"].nunique())
print("\nFeature Categories:")
print(dias_atts["Information level"].value_counts())
def explore_category(category, azdias):
"""Prints description, null percentage and value counts of each feature in a specified category."""
# query only features in category
cat_feats = dias_atts[dias_atts["Information level"] == category]
# calcualte null percentage of each features
cat_feats["null_percentage"] = cat_feats["Attribute"].apply(lambda feat: azdias[feat].isna().sum()/azdias.shape[0])
# sort by null percentage
cat_feats = cat_feats.sort_values("null_percentage")
print(f"Number of features in {category}: {len(cat_feats)}\n\n")
for i, row in cat_feats.iterrows():
feat = row["Attribute"]
print(feat)
print(row["Description"])
print("Null percentage:", row["null_percentage"])
print(azdias[feat].value_counts())
print()
According to https://datarade.ai/data-products/plz8-germany-and-plz8-germany-xxl, Germany as been divided into 84,000 PLZ8 boundaries, so this category contains socio-economic data related to each PLZ8 boundary which helps in optimizing distribution of promotional materials, and in our case the mail-order sales.
explore_category("PLZ8", azdias_new)
plz8_feats = dias_atts[dias_atts["Information level"] == "PLZ8"]["Attribute"].unique()
plz8_azdias = azdias_new.loc[:, plz8_feats]
null_p = plz8_azdias.isna().sum()*100/plz8_azdias.shape[0]
null_p.value_counts().plot(kind="bar", title="Null Percentages in PLZ8 Features");
plt.tight_layout();
plt.savefig("null_plz8.png");
This rings the bell for data MCAR or MNAR, as there is a pattern that is most likely unrelated to observed data, but can or can't be related to unobserved data, which is why some persons just don't have PLZ8 data collected.
To answer this question, we first need to know if there is a difference between people who don't have PLZ8 and those who do, as might introduce bias into the model if removed data for a specific group of people.
We can further investigate this difference using other features to test if the data was MNAR or MCAR.
tmp = azdias_new["KBA13_ANZAHL_PKW"]
tmp.plot(kind="hist", bins=50, title="KBA13_ANZAHL_PKW");
plt.tight_layout()
plt.savefig("KBA13_ANZAHL_PKW.png")
We can see that the bins starts getting less granular is we exceed 1200. My guess is that this data was spread between 1300 and the max values, but the granularity of these section was decreased, which explains why the distribution is right skewed but then we start seeing bumps near the end.
We could leave it as is, as I don't think it would make much difference. Or we can follow the lead of the last section and reduce the granularity of the whole feature.
explore_category("Microcell (RR3_ID)", azdias_new)
This is data about a collective of individuals, and it should mean that we have any indicator about this collective, like an identifier for the microcell of this person, or the plz8.
I looked for any feature that has the postcode or the PLZ8 area or anything, but I didn't find anything. Therefore I think that in handling missing data we should do the following:
explore_category("Person", azdias_new)
First let's determine a threshold for the percentage of missing values in a feature to include this test, and that should be the highest repeating missing values percentage we have seen so far which is 16.6%.
# all features with missing percentage less than 17% and higher than 0%
features_missing = azdias.columns[((azdias_new.isnull().sum() / azdias.shape[0]) < 0.17) &
((azdias_new.isnull().sum() / azdias.shape[0])) > 0]
# narrow down to only features of the explored categories
plz8_feats = dias_atts[dias_atts["Information level"] == "PLZ8"]["Attribute"].unique()
rr3_feats = dias_atts[dias_atts["Information level"] == "Microcell (RR3_ID)"]["Attribute"].unique()
person_feats = dias_atts[dias_atts["Information level"] == "Person"]["Attribute"].unique()
features_missing = list(set(plz8_feats).union(rr3_feats).union(person_feats).intersection(features_missing))
# azdias with only features that have missing values
azdias_missing = azdias_new[features_missing]
# flag rows with all missing values
rows_all_missing = azdias_missing.isna().sum(axis=1) == azdias_missing.shape[1]
print("Number of rows with all missing values:", rows_all_missing.sum())
print("Percentage of rows with all missing values:", rows_all_missing.sum()/azdias_missing.shape[0])
We can see that there are 7 rows that have all the values for the features explored missing, and we already know that there are rows with high percentage of missing features. But how many are there exactly?
rows_missing_p = azdias_missing.isnull().sum(axis=1)/azdias_missing.shape[1]
for i in np.arange(0, 0.8, 0.1):
print("Percentage of rows with more than {:.2f}% values missing: {}".format(i*100, (rows_missing_p>i).sum()/azdias.shape[0]))
So let's look at this information using the whole dataset.
rows_missing_p = azdias_new.isnull().sum(axis=1)/azdias_new.shape[1]
for i in np.arange(0, 0.8, 0.1):
print("Percentage of rows with more than {:.2f}% values missing: {}".format(i*100, (rows_missing_p>i).sum()/azdias.shape[0]))
# calculate features missing values
feat_missing_count = azdias_new.isna().sum()
# filter out rows with more than 50% missing values
half_missing_rows = azdias_new[rows_missing_p > 0.5]
# transpose the dataframe to make the features as index
half_missing_rows_t = half_missing_rows.T
# calculate the percentage of null values in these rows for each feature
half_missing_rows_t["null_percentage"] = half_missing_rows_t.isna().sum(axis=1)/feat_missing_count
# sort values by null_percentage
half_missing_rows_t = half_missing_rows_t.sort_values("null_percentage", ascending=False)
# select features that have null values
half_missing_rows_t = half_missing_rows_t.query("null_percentage > 0")
# print each feature, category and percent of missing values
category_missing = defaultdict(list)
category_missing_p = defaultdict(list)
category_missing_count = defaultdict(int)
for feat, p in half_missing_rows_t["null_percentage"].iteritems():
if feat in set(dias_atts.Attribute):
category = dias_atts.query("Attribute == @feat")["Information level"].item()
else:
category = "Unknown"
category_missing_p[category].append(p)
if p > 0.9:
category_missing[category].append(feat)
category_missing_count[category] += 1
print(feat, category, p)
This doesn't give us much information about which feature categories are most affected by these rows, except that that majority of these features have at least more than 60% missing values in these row.
plt.figure(figsize=(10, 4))
for category, p in category_missing_p.items():
plt.hist(p, alpha=0.5, label=category)
plt.title("Distribution of Null Values Ratio between Data with >= 50% Missing Values and Full Data by Category")
plt.xlabel("Null Ratio")
plt.legend()
plt.tight_layout()
plt.savefig("null_ratio_1.png");
category_mean_p = pd.Series({category: np.mean(p) for category, p in category_missing_p.items()}).sort_values(ascending=False)
category_mean_p.plot(kind="bar", title="Mean Null Values Ratio between Data with >= 50% Missing Values and Full Data");
plt.xlabel("Categories")
plt.ylabel("Mean Null Ratio")
plt.tight_layout()
plt.savefig("null_ratio_2.png");
# first we need to add the original count of Unknown features
original_category_count = dias_atts["Information level"].value_counts()
original_category_count["Unknown"] = len(azdias_feats.difference(dias_atts_feats))
original_category_count = original_category_count.sort_values(ascending=False)
category_half_missing_count = pd.Series(category_missing_count).sort_values(ascending=False)
category_half_missing_count[original_category_count.index].plot(kind="bar", alpha=0.5, color='blue', label="Missing rows")
original_category_count.plot(kind="bar", alpha=0.5, color='green', label="All", figsize=(9, 4))
plt.xticks(rotation=90);
plt.legend()
plt.title("Number of Features with More than 90% Missing Values in Rows with More than 50% Missing Values")
plt.tight_layout()
plt.savefig("drop_missing_rows.png")
explore_category("Household", azdias_new)
explore_category("Microcell (RR4_ID)", azdias_new)
explore_category("Building", azdias_new)
explore_category("RR1_ID", azdias_new)
explore_category("Postcode ", azdias_new)
explore_category("Community", azdias_new)
Since we have no information about the feature description, let's look through them looking for what they mean and try to find which features are worth keeping and which are not.
# find features that don't have categories and transpose df so that features are index
no_cat_feats = azdias_new[list(set(azdias_feats).difference(dias_atts_feats))].T
# calculate null_percentage of features
no_cat_feats["null_percentage"] = no_cat_feats.isna().sum(axis=1)/no_cat_feats.shape[1]
# sort by null percentage
no_cat_feats.sort_values("null_percentage")
print(f"Number of features of unknown category: {len(no_cat_feats)}\n\n")
for feat, p in no_cat_feats["null_percentage"].iteritems():
print(feat)
print("Null percentage:", p)
print(azdias_new[feat].value_counts())
print()
From looking at the first feature, I can see that there is alot of values that are missing encoded as 0, and since these features weren't included in the Values sheet, their unknown values wasn't changed to null when we changed the rest of the features. So let's change them and take a look again.
# find features that don't have categories
no_cat_feats = azdias_new[list(set(azdias_feats).difference(dias_atts_feats))].copy()
# change 0 and -1 to null in non-binary features
for feat in no_cat_feats.columns:
if no_cat_feats[feat].nunique() > 2:
no_cat_feats[feat].replace(-1, np.nan, inplace=True)
no_cat_feats[feat].replace(0, np.nan, inplace=True)
# transpose df so that features are index
no_cat_feats = no_cat_feats.T
# calculate null_percentage of features
no_cat_feats["null_percentage"] = no_cat_feats.isna().sum(axis=1)/no_cat_feats.shape[1]
# sort by null percentage
no_cat_feats.sort_values("null_percentage", inplace=True)
print(f"Number of features of unknown category: {len(no_cat_feats)}\n\n")
for feat, p in no_cat_feats["null_percentage"].iteritems():
print(feat)
print("Null percentage:", p)
print(azdias_new[feat].value_counts())
print()
Note that this time we have included 0 and -1 in the Null percentage, but we can still see them in the value counts. And that is to give is the ability to see if 0 is or isn't really a missing value in some features.
Right now I'll try to find the translation of features if they don't have a huge percentage of missing values.
%%time
# calculate all features null correlation
azdias_null_corr = azdias_new.replace(0, np.nan).isna().astype(int).corr()
# top 50 features correlated with null values in EINGEFUEGT_AM
azdias_null_corr["EINGEFUEGT_AM"].sort_values(ascending=False)
# make mappings from feature to category
from collections import defaultdict
cats = ["PLZ8",
"Microcell (RR3_ID)",
"Person",
"Household",
"Microcell (RR4_ID)",
"Building",
"RR1_ID",
"Postcode",
"Community"]
for cat in cats:
mappings = {feat: cat for cat in cats for feat in dias_atts[dias_atts["Information level"] == cat].Attribute}
unknown_mappings = {feat: "Unknown" for feat in no_cat_feats.index}
# plot distributions of feature categories null values correlation with EINGEFUEGT_AM
eing_null_corr = azdias_null_corr["EINGEFUEGT_AM"].reset_index().replace({"index": mappings})
eing_null_corr = eing_null_corr.replace({"index": unknown_mappings})
plt.figure(figsize=(8, 6))
ax = plt.subplot()
for cat, cat_null in eing_null_corr.groupby('index'):
cat_null.hist(label=cat, alpha=0.5, ax=ax)
plt.legend()
plt.tight_layout();
We can see that EINGEFUEGT_AM shares it's null values with features in difference categories, but the most prominent are Building, PLZ8, Household and some Unknown features. There is no clear pattern other than those, as it is probable that other features share null values in the same row, so there is no insight that we act upon.
def replace_unknown_with_null(df):
"""
Replace unknown values encoded as 0 and -1 into null values.
"""
# Read Values sheets which has information about unknown encodings
dias_vals = pd.read_excel("DIAS Attributes - Values 2017.xlsx")
# Find unknown values for features
feat_unknown_vals = dias_vals.query("Meaning == 'unknown'")
# Replace unknown values for each feature in the dataframe
for feat in feat_unknown_vals.itertuples():
# Check if feature in dataframe featuers
if feat.Attribute in df.columns:
# if unknown values are more than one
if ',' in str(feat.Value):
# loop over unknown values
for val in str(feat.Value).split(','):
# replace unknown value with null
df[feat.Attribute].replace(eval(val), np.nan, inplace=True)
else:
# replace unknown value with null
df[feat.Attribute].replace(feat.Value, np.nan, inplace=True)
# Replace other unknown values that aren't in Values sheet
df["ALTER_HH"].replace(0, np.nan, inplace=True)
df["KOMBIALTER"].replace(0, np.nan, inplace=True)
# Replace Non-binary features with 0 and -1 that are of unknown category to null
unknown_feats = df[list(set(df.columns).difference(dias_vals.Attribute))].copy()
for feat in unknown_feats:
if df[feat].nunique() > 2:
df[feat].replace(-1, np.nan, inplace=True)
df[feat].replace(0, np.nan, inplace=True)
return df
def clean_dataset(df, p_row=0.5, p_col=0.5, drop_uncertain=True, keep_features=[]):
"""
Clean dataset using insights gained during EDA.
inputs
1. df (pandas dataframe)
2. p_thresh (float) - maximum threshold of null values in columns
3. drop_uncertain (bool) - drop features we are uncertain from
"""
# Make a new copy of the dataframe
clean_df = df.copy()
# Clean columns with mixed dtypes
clean_df["CAMEO_DEUG_2015"].replace('X', np.nan, inplace=True)
clean_df["CAMEO_INTL_2015"].replace('XX', np.nan, inplace=True)
# Replace unknown values with missing
clean_df = replace_unknown_with_null(clean_df)
# Drop rows with more than 50% missing values
min_count = int(((1 - p_row))*clean_df.shape[1] + 1)
clean_df.dropna(axis=0, thresh=min_count, inplace=True)
# Drop duplicated rows
clean_df.drop_duplicates(inplace=True)
# Drop GEBURTSJAHR (year of birth) that has 44% missing values
clean_df.drop('GEBURTSJAHR', axis=1, inplace=True)
# Drop LNR which is a unique indentifier
clean_df.drop('LNR', axis=1, inplace=True)
# Drop CAMEO_DEU_2015 as it's not suitable for PCA
clean_df.drop('CAMEO_DEU_2015', axis=1, inplace=True)
# Drop features with more than p_thresh missing values
features_missing_p = clean_df.isna().sum() / clean_df.shape[0]
features_above_p_thresh = clean_df.columns[features_missing_p > p_col]
features_to_keep = ["ALTER_KIND1", "ALTER_KIND2", "ALTER_KIND3", "ALTER_KIND4", "ANZ_KINDER"] + keep_features
features_to_remove = [feat for feat in features_above_p_thresh if feat not in features_to_keep]
clean_df.drop(features_to_remove, axis=1, inplace=True)
# Drop uncertain features
if drop_uncertain:
uncertain_features = ["GEMEINDETYP"]
clean_df.drop(uncertain_features, axis=1, inplace=True)
# Feature Engineering
# One Hot Encoding D19_LETZTER_KAUF_BRANCHE
dummies = pd.get_dummies(clean_df["D19_LETZTER_KAUF_BRANCHE"], prefix="D19_LETZTER_KAUF_BRANCHE")
clean_df = pd.concat([clean_df, dummies], axis=1)
clean_df.drop("D19_LETZTER_KAUF_BRANCHE", axis=1, inplace=True)
# Calculate year difference in MIN_GEBAEUDEJAHR
clean_df["MIN_GEBAEUDEJAHR_ENG"] = (2017 - clean_df["MIN_GEBAEUDEJAHR"])
clean_df.drop("MIN_GEBAEUDEJAHR", axis=1, inplace=True)
# Calculate days difference in EINGEFUEGT_AM
current = datetime.strptime("2017-01-01", "%Y-%m-%d")
clean_df["EINGEFUEGT_AM_DAY"] = (current - pd.to_datetime(clean_df["EINGEFUEGT_AM"])).dt.days
clean_df.drop("EINGEFUEGT_AM", axis=1, inplace=True)
# Replace null values in ALTER_KIND and ANZ_KINDER with 0 to avoid imputation
for feat in clean_df.columns[clean_df.columns.str.startswith("ALTER_KIND")]:
clean_df[feat].replace(np.nan, 0, inplace=True)
clean_df["ANZ_KINDER"].replace(np.nan, 0, inplace=True)
# Convert OST_WEST_KZ to binary labels
clean_df["OST_WEST_KZ"] = (clean_df["OST_WEST_KZ"] == "W").astype(np.uint8)
# Convert CAMEO_INTL_2015 and CAMEO_DEUG_2015 to float32
CAMEO_feats = ["CAMEO_INTL_2015", "CAMEO_DEUG_2015"]
clean_df[CAMEO_feats] = clean_df[CAMEO_feats].astype(np.float32)
# Convert float16 features to float32 to enable arithmetic operations
float_feats = clean_df.select_dtypes(np.float16).columns
clean_df[float_feats] = clean_df[float_feats].astype(np.float32)
return clean_df
clean_azdias = clean_dataset(azdias)
# pickle clean AZDIAS
pd.to_pickle(clean_azdias, "clean_azdias.pkl")
# Delete AZDIAS to free up memory
del azdias
gc.collect()
clean_azdias = pd.read_pickle("clean_azdias.pkl")
features_missing = (clean_azdias.isna().sum() / clean_azdias.shape[0]).sort_values(ascending=False)
for feat, p in features_missing.iteritems():
print(feat)
print(p)
print(clean_azdias[feat].value_counts())
print()
features_missing.plot(kind="hist", bins=20, title="Histogram of Missing Values Proportions in Features after Cleaning");
plt.tight_layout()
plt.savefig("missing_after_cleaning.png")
We can see now that imputations won't be a big deal, and that there is very small number of features that exceed 10% missing values.
My gut feeling says that I should go with KNN imputation, as I think that features of the same category could be of huge aid in determining which value to impute with, instead of imputing with 0 or mode (since we are dealing with ordinal data).
But KNN imputation will be time and memory intensive, so we can't do on all features. So we probably should stick with an easy strategy like mode imputation for features below a threshold and KNN for features above.
Since KNN is memory and time intensive, we need the majority of features to be imputed using mode values, and only small number of features shall be imputed using KNN.
# Null percentages in features
na_features = clean_azdias.isna().sum() / clean_azdias.shape[0]
# Filter out features that have no null values
na_features = na_features[na_features > 0]
na_features.plot(kind="box");
plt.title("Box Plot of Null Percetnages in Features");
plt.ylabel("Percetnage")
plt.tight_layout()
plt.savefig("missing_after_cleaning_box.png");
Using this box plot we can see that we can only impute features above 10% using KNN. So in order to do that we need to first impute features lower than 10% using mode imputation, then impute the rest of the features using KNN.
Let's test this separately first to ensure that it would work.
# Features below 10% null percentage
features_below_10 = na_features[na_features < 0.1].index
# Features indexes
features_below_10_idx = np.argwhere(clean_azdias.columns.isin(features_below_10)).ravel()
# Most frequent imputer using ColumnTransformer
# mode_imputer = ColumnTransformer([
# ('mode', SimpleImputer(strategy="most_frequent"), features_below_10_idx)
# ], remainder="passthrough")
%%time
# mode_azdias = mode_imputer.fit_transform(clean_azdias)
# mode_azdias = pd.DataFrame(mode_azdias, index=clean_azdias.index, columns=clean_azdias.columns)
# Features above 10% null percentage
features_above_10 = na_features[na_features > 0.1].index
# Features indexes
features_above_10_idx = np.argwhere(clean_azdias.columns.isin(features_above_10)).ravel()
# Check number of features with missing value equals features above 10%
# mode_azdias.isna().any(axis=0).sum() == len(features_above_10)
# Check dtype
# mode_azdias.info()
# Change dtype to float32
# mode_azdias = mode_azdias.astype(np.float32)
# %%time
# # Impute the rest of the features using KNN Imputer
# knn_imputer = KNNImputer()
# mode_azdias = knn_imputer.fit_transform(mode_azdias)
# mode_azdias = pd.DataFrame(mode_azdias, index=clean_azdias.index, columns=clean_azdias.columns)
So instead, I'll impute the remaining 10 features using mean imputation to avoid introducing any spike in the data since we are dealing with features that might have more than 20% missing values.
%%time
imputer = ColumnTransformer([
('mode', SimpleImputer(strategy="most_frequent"), features_below_10),
('mean', SimpleImputer(strategy="mean"), features_above_10)
], remainder="passthrough")
imputed_azdias = imputer.fit_transform(clean_azdias).astype(np.float32)
# Imputed features as arranged in ColumnTransformer
imputed_features = list(features_below_10) + list(features_above_10)
# Features that weren't transformed
rem_features = [feat for feat in clean_azdias.columns if feat not in imputed_features]
# New arrangement of features which is outputted by ColumnTransformer
new_features = imputed_features + rem_features
# Convert imputed_azdias into DataFrame object for further analysis
imputed_azdias = pd.DataFrame(imputed_azdias, index=clean_azdias.index, columns=new_features)
# Sort features using null percentages
na_features = na_features.sort_values(ascending=False)
# Loop over features
for feat, p in na_features.iteritems():
fig, axes = plt.subplots(1, 2, sharey=True, figsize=(15, 4))
# Feature before imputation (with imputing null values to 0 as they won't be visualized otherwise)
null_feat = clean_azdias[feat].fillna(0)
# Feature after imputation
imputed_feat = imputed_azdias[feat]
# Plot histogram if feature has more than 11 unique values
if null_feat.nunique() > 11:
null_feat.plot(kind="hist", bins=20, color="tab:blue", ax=axes[0])
# Plot bar otherwise
else:
null_feat.value_counts().sort_index().plot(kind="bar", color="tab:blue", ax=axes[0])
axes[0].set_title("{} - {:.2f}%".format(feat, p*100))
if imputed_feat.nunique() > 11:
null_feat.plot(kind="hist", bins=20, color="tab:blue", ax=axes[1])
else:
imputed_feat.value_counts().sort_index().plot(kind="bar", color="tab:blue", ax=axes[1])
axes[1].set_title(f"{feat} After Imputation")
plt.show()
# save imputed azdias in pickle format
pd.to_pickle(imputed_azdias, "imputed_azdias.pkl")
# save imputer in pickle format
pickle.dump(imputer, open("imputer.pkl", "wb"))
The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.
# load imputed azdias dataset
imputed_azdias = pd.read_pickle("imputed_azdias.pkl")
# load imputer
imputer = pickle.load(open("imputer.pkl", "rb"))
# scaler = StandardScaler()
# pca = IncrementalPCA()
# %%time
# scaler.fit(imputed_azdias)
After trying to fit the scaler on the whole data, it turns out that it takes a really long time, and that we are better off fitting on batches of the dataset. So let's make a function that makes fitting any sklearn transformer using batches easy.
# scaler = batch_fit_scaler(scaler, imputed_azdias)
# pca = batch_fit_pca(pca, scaler, imputed_azdias)
# pickle.dump(scaler, open("scaler.pkl", "wb"))
# pickle.dump(pca, open("pca.pkl", "wb"))
scaler = pickle.load(open("scaler.pkl", "rb"))
pca = pickle.load(open("pca.pkl", "rb"))
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print("Minimum number of dimensions that have 95% of original data variance:", d)
# Transform and keep only d dimensions
pca_azdias = batch_transform_pca(pca, scaler, imputed_azdias)[:, :d]
K = range(1, 10)
K_inertia = []
init = "k-means++"
n_init = 10
n_batches = 100
batch_size = pca_azdias.shape[0]//n_batches
seed = 42
for k in K:
print(f"Running with k = {k}")
kmeans = MiniBatchKMeans(n_clusters=k, init=init, n_init=n_init, batch_size=batch_size, random_state=seed)
kmeans.fit(pca_azdias)
K_inertia.append(kmeans.inertia_)
print(f"Inertia: {kmeans.inertia_}\n")
plt.plot(K, K_inertia, linestyle='-', marker='o')
plt.title('Inertia as a Function of k Clusters')
plt.xlabel('k Clusters')
plt.ylabel('Inertia')
plt.xticks(K, K)
plt.tight_layout()
plt.savefig("kmean_inertia.png");
k = 7
init = "k-means++"
n_init = 10
n_batches = 100
batch_size = pca_azdias.shape[0]//n_batches
seed = 42
kmeans = MiniBatchKMeans(n_clusters=k, init=init, n_init=n_init, batch_size=batch_size, random_state=seed)
kmeans.fit(pca_azdias)
pickle.dump(kmeans, open("kmeans.pkl", "wb"))
kmeans = pickle.load(open("kmeans.pkl", "rb"))
# Calculate percentage of each cluster in azdias
azdias_clusters_p = pd.Series(kmeans.labels_).value_counts()/len(kmeans.labels_)
# Plot the azdias clusters percentages
azdias_clusters_p.plot(kind="bar");
plt.title("Clusters of AZDIAS")
plt.xlabel("Cluster")
plt.ylabel("Count");
# Clean the customers dataset using the pipeline we made earlier
clean_customers = clean_dataset(customers, keep_features=["KBA13_ANTG4"])
# Re-arrange columns to be just as clean_azdias to avoid any problems with imputer
clean_customers = clean_customers[clean_azdias.columns]
# Impute missing values using imputer fitted on clean_azdias
imputed_customers = imputer.transform(clean_customers)
# transform using scaler
scaled_customers = scaler.transform(imputed_customers)
# transform using pca
pca_customers = pca.transform(scaled_customers)[:, :d]
# Predict KMeans labels
customers_clusters = kmeans.predict(pca_customers)
# Calculate percentage of each cluster in customers
customers_clusters_p = pd.Series(customers_clusters).value_counts()/len(customers_clusters)
# Plot the customers clusters percentages
customers_clusters_p.plot(kind="bar");
plt.title("Clusters of CUSTOMERS")
plt.xlabel("Cluster")
plt.ylabel("Percentage");
This question is better answered using a bar chart with hue of whether the cluster belongs to AZDIAS or CUSTOMERS. And in order to do that we need to make a dataframe that has 1 column for AZDIAS and another for CUSTOMERS.
# Concatenate azdias and customers cluster percentages
clusters_p = pd.concat([azdias_clusters_p, customers_clusters_p], axis=1)*100
# Rename columns to AZDIAS and CUSTOMERS
clusters_p.columns = ["AZDIAS", "CUSTOMERS"]
clusters_p.sort_values(["CUSTOMERS", "AZDIAS"], ascending=False).plot(kind="bar", color=["grey", "tab:blue"])
plt.title("How Are Clusters Represented in Customer and General Population?")
plt.xlabel("Cluster")
plt.ylabel("Percentage %")
plt.tight_layout()
plt.savefig("clusters.png");
# Filtering individuals from cluster 0 in AZDIAS
cluster_0 = clean_azdias[kmeans.labels_ == 0]
# Checking the dimensions of the dataframe
cluster_0.shape
# Taking a quick look at the statistics of the cluster
cluster_0.describe()
As we can see it's hard to get any insight from the statistics, so I think it would be better if were able to visualize all features, so I'll make a function that visualizes all features of a given cluster.
# Make an index for each category in order to sort them during visualization
cats = ["Person",
"Household",
"Building",
"Community",
"Postcode",
"PLZ8",
"RR1_ID",
"Microcell (RR3_ID)",
"Microcell (RR4_ID)",
"Unknown"]
cats_order = {cat: i for i, cat in enumerate(cats)}
cats_order
feat_cat_tuples = []
known_feats = set(dias_atts.Attribute)
# Match each feature with category and assign unknown if no category exists
for feat in clean_azdias.columns:
cat = "Unknown"
if feat in known_feats:
cat = dias_atts[dias_atts["Attribute"] == feat]["Information level"].item().strip()
feat_cat_tuples.append((feat, cat))
# Make dataframe from resulting tuples of feature and category
feat_cat_df = pd.DataFrame(feat_cat_tuples, columns=["feature", "category"])
# Assign category order for each category
feat_cat_df["cat_order"] = feat_cat_df["category"].map(cats_order)
# Sort by category then alphabeticaly over features
feat_cat_df.sort_values(by=["cat_order", "feature"], inplace=True)
feat_cat_df.head()
compare_features(dfs=[cluster_0, clean_azdias], labels=["Cluster 0", "AZDIAS"])
# Filtering individuals from cluster 0 in AZDIAS
cluster_4 = clean_azdias[kmeans.labels_ == 4]
cluster_6 = clean_azdias[kmeans.labels_ == 6]
compare_features([cluster_0, cluster_4, cluster_6, clean_azdias], ["Cluster 0", "Cluster 4", "Cluster 6", "Population"])
compare_features([cluster_4, clean_azdias], ["Cluster 4", "Population"])
compare_features([cluster_6, clean_azdias], ["Cluster 6", "Population"])
cluster_5 = clean_azdias[kmeans.labels_ == 5]
compare_features([cluster_0, cluster_5], ["Cluster 0", "Cluster 5"])
feats_to_drop = ["D19_BANKEN_DATUM", "D19_BANKEN_OFFLINE_DATUM", "D19_BANKEN_ONLINE_DATUM", "D19_TELKO_DATUM",
"D19_TELKO_OFFLINE_DATUM", "D19_TELKO_ONLINE_DATUM", "D19_VERSI_DATUM", "D19_VERSI_OFFLINE_DATUM",
"D19_VERSI_ONLINE_DATUM", "ANZ_HH_TITEL", "KBA05_MODTEMP", "BALLRAUM", "INNENSTADT", "EWDICHTE",
]
There are also some features that shows no difference between the two groups, specifcally features that are related to motor vehciles information in PLZ8 areas or microcells. Which indicated that we might want to use some sort of variance thresholding passing the data to a machine learning algorithm.
The way the I'll do this is that I'll partition clusters with high tendency to become customers in on dataframe, and others with low tendency in another dataframe, then I'll visualize the stated features between them.
I'll also make separate plot that dissects each group's clusters in order to have an idea about the differences between them.
clusters_p.sort_values(["CUSTOMERS", "AZDIAS"], ascending=False).plot(kind="bar", color=["grey", "tab:blue"]);
plt.tight_layout()
plt.title("How Are Clusters Represented in Customer and General Population?");
My intuition is that we should visualize the differences between cluster 0-4-6 and clusters 2-3-5 leaving out cluster 1, as individuals in this cluster have really similar tendencies of being customers or non-customers. So that would decrease the sharpness of differences between the two groups.
customers = clean_azdias[np.in1d(kmeans.labels_, [0, 4, 6])]
cluster_0 = clean_azdias[kmeans.labels_ == 0]
cluster_4 = clean_azdias[kmeans.labels_ == 4]
cluster_6 = clean_azdias[kmeans.labels_ == 6]
non_customers = clean_azdias[np.in1d(kmeans.labels_, [2, 3, 5])]
cluster_2 = clean_azdias[kmeans.labels_ == 2]
cluster_3 = clean_azdias[kmeans.labels_ == 3]
cluster_5 = clean_azdias[kmeans.labels_ == 5]
While plotting, we need to map values of features to their meanings in order to make it easier to understand the plots. So let's check the DIAS VALUES sheets.
dias_vals.head(10)
We need to forward fill Attribute in order to map Value to Meaning.
# ffill Attribute
dias_vals["Attribute"].fillna(method="ffill", inplace=True)
dias_vals.head()
We also need to change multiple values in Value into one.
# change multiple values in Value into one value
dias_vals["Value"] = dias_vals["Value"].replace({"-1, 0": 0, "-1, 9": 9})
dias_vals.head(10)
Now we need to figure out how to make a mapping for each feature between it's value and meaning.
feat = "ALTERSKATEGORIE_GROB"
# filter out featue
feat_vals = dias_vals[dias_vals["Attribute"] == feat]
# make a series of Value and Meaning with Value as index
feat_vals_meaning = feat_vals[["Value", "Meaning"]].set_index("Value")
# convert series to dict
feat_vals_meaning = feat_vals_meaning.to_dict()["Meaning"]
feat_vals_meaning
compare_feature("ALTERSKATEGORIE_GROB", title="Age", figsize=(12, 5))
plt.tight_layout()
plt.savefig("Age.png")
We can see that customers have greater probability of being older, with almost 80% being above 45 years old. On the other hand non-customers tend have more than 50% of less than 46 years old. The age groups that is mostly shared between the two groups is 46-460 years group.
We can see that cluster 4 stand out with higher percentage of indiviudals less than 46 years old, while cluster 0 has more than 90% of it's population above 46 years old.
So cluster 0 includes is mostly elders with the majority being above 60 years old, while cluster 4 has the majority above 45 but also has higher than average percentage of younger indivduals, and cluster 6 is similar to cluster 0 except that the percentage of 46-60 years indivduals is larger than cluster 0.
compare_feature("ANREDE_KZ", "Gender", figsize=(12, 4))
plt.tight_layout()
plt.savefig("Gender.png")
The percentage of males in customers is higher than that in non-customers, while the percentage of females in both is higher than males.
Cluster 0 has an over representation of males, where males is higher than all clusters and higher than female percentage in the same cluster. While cluster 4 and 6 have higher female percentages than cluster 0.
compare_feature("CJT_GESAMTTYP", "Channels", figsize=(12, 5))
# plt.tight_layout()
plt.savefig("Channels.png", bbox="tight")
We can see than customers exceed non-customers in percentages of advertising and consumption minamilists and traditionalists, while non-customers tend to be more open in that spectrum.
Since cluster 0 mostly represents elderly individuals, it's expected that they will be over represented in the minimalists and traditionlists. And also since cluster 4 represents the younger customers, we don't see alot of them as minimalist and traditionalists. And finally we can see that cluster 6 has the most uniform distribtution across the spectrum.
compare_feature("FINANZTYP", "Financial Type", figsize=(12, 5))
plt.tight_layout()
plt.savefig("Financial.png")
20% of customers are money savers, while another 20% are inverstors, and around 35% are unremarkable which I guess means that they have no specific financial type. On the other hand, non-customers tend to have low financial interest.
We can that the majority of cluster 0 with distinguished financial type are money savers, while in cluster 6 they are investors. Cluster 4 doesn't show a specific type.
compare_feature("LP_LEBENSPHASE_GROB", "Life Stage", figsize=(12, 6))
# plt.tight_layout()
plt.savefig("Life_Stage.png", bbox="tight")
The most frequent non-customer type is single low-income and average earners of younger age, while customers' most frequent type is singe high-income earner. However, there is no great difference between the most frequent value of customers and two next most frequent values, indicating the difference between clusters.
Around 70% of cluster 6 are single, with the majority of them being single low-income average earners of higher age., while the most frequent type in cluster 0 is single high-income earners, while cluster 4's most frequent type is high income earner of higher age from multiperson households. However, the remaining majority of cluster 4 types falls in younger aged families with different types of income.
compare_feature("RETOURTYP_BK_S", "Return Type", figsize=(12, 6))
plt.tight_layout()
plt.savefig("Return.png");
The most frequent type in customers is determined minimal returner, which I think means that these individuals aren't the shopping type. They only buy what they need when they need it. The second frequent type in incentive receptive normal returner. While in non-customers, we can see that the most frequent type is influencable crazy shopper, and these wouldn't definetly be interested in mail-order cataloges.
First off we can see the cluster 0 and 6 are the only populating most of the customers belonging to the determined minimal returner category, and that makes sense since they are older individuals, and we have found that they are consumption minimalists and traditionalists. On the other hand, cluster 4 populates every other category with frequency higher than the determined minmal returner one, with them most frequent being demanding heavy returner.
compare_feature("ALTER_HH", "Main Age within Household", figsize=(12, 5))
plt.tight_layout()
plt.savefig("Main_Age.png")
We have already investigated the age difference between customers and non-customers, and we can see that the main age within the household is also different between the two groups, where customers households tend be also older in age, while non-customers households tend to be younger.
We can see that cluster 4 is the main cluster populating younger ages in customers clusters, while cluster 0 and 6 have nearly identical distributions representing the elderly segments of the customers.
compare_feature("HH_EINKOMMEN_SCORE", "Net Household Income", figsize=(12, 4))
plt.tight_layout()
plt.savefig("Net_Household.png");
We can see a huge difference between the distribution of customers and non-customers among estimated net household income, where more than 50% of non-customers come from very low income households, and only around 15% of customers do. The most frequent in customers is average income, and the second most is lower income. However, the total percentage of customers whose income is average or above exceeds 50%.
Now we can see a difference between the two older segments, which are cluster 0 and 6. We can see that over 60% of cluster 6 households have either lower or very low income, while more than 70% of cluster 0 has average or higher income. Similarily cluster 4 also has around 70% of it's households having average or higher income.
Will that would be the case if this feature indicated the income of the individual, however since this feature indicates the net household income, this doesn't say anything about the specific individuals in cluster 0 and 6. Since cluster 6 had more tendency to be single, it makes sense that cluster 0 household income would be higher, because if cluster 0 is financially above average, it's safe to say that probably the rest of their family is the same, and that would make their net income larger than the same individual if he was single, and that's the situation for cluster 6.
compare_feature("WOHNLAGE", "Neighborhood Area", figsize=(12, 6))
plt.tight_layout()
plt.savefig("Neighborhood.png");
The most frequent neighborhood area in both customers and non-customers is average neighborhoods, however the next most frequent for customers is rural neighborhoods, while that of non-customers is poor neighboorhoods. We can also see that the percentage of customers occupying above average neighborhood areas is larger than non-customers'.
We can see that our remark about the household income difference between cluster 0 and 6 has been useful, because cluster 6 to have the highest percentage occupying average and above neighborhood areas, while the most frequent neighborhood area for cluster 0 is rural areas, since they are mostly families. Cluster 4 is extremely similar to cluster 0 in this attribute.
compare_feature("MOBI_REGIO", "Moving Patterns", figsize=(12, 4))
plt.tight_layout()
plt.savefig("Moving_Patterns.png");
50% of customers are classified as having low or very mobility, while more than 60% of non-customers are the extreme opposite, with classification of either high or very high mobility.
Once again we can see some of the differing factors between cluster 0 and 6, since cluster 6 are mostly single individuals, more than 60% of their moving pattern is high or very high and 25% have middle mobility. On the other hand, since cluster 0 and 4 tend to in families, their mobility is much lower than cluster 6, with almost 75% of cluster 0 having low or very low mobility, and 65% of cluster 4 having the same.
Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.
The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.
We have already used the general population data for scaling, dimensionality reduction and clustering. We can use this pipeline for engineering new features, but right now the best thing is to make the most basic model that we can ever make which can deliver the fastest results and then we can think about how we can imporve it.
So we'll follow the main steps that we have followed before to see if this dataset will have similar properties to the general population dataset in order to see if the cleaning pipeline will need any changes.
mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
# load imputer
imputer = pickle.load(open("imputer.pkl", "rb"))
# load scaler and pca
scaler = pickle.load(open("scaler.pkl", "rb"))
pca = pickle.load(open("pca.pkl", "rb"))
d = 224
# load kmeans model
kmeans = pickle.load(open("kmeans.pkl", "rb"))
# load clean_azdias
clean_azdias = pickle.load(open("clean_azdias.pkl", "rb"))
# Replace unknown values with null
mailout_train_new = replace_unknown_with_null(mailout_train)
# Check null percentages in columns and rows
fig = plt.figure(figsize=(14, 4))
plt.subplot(1, 2, 1)
plt.title("Null Percentages in Columns")
mailout_null_columns = get_null_prop(mailout_train_new, axis=0, plot=True)
plt.subplot(1, 2, 2)
plt.title("Null Percentages in Rows")
mailout_null_rows = get_null_prop(mailout_train_new, axis=1, plot=True)
plt.tight_layout()
plt.savefig("mailout_null_before.png")
# Test cleaning the dataset using cleaning function
mailout_train_clean = clean_dataset(mailout_train)
print("Shape before cleaning:", mailout_train.shape)
print("Shape after cleaning:", mailout_train_clean.shape)
# Check if feature set is the same as clean AZDIAS
set(mailout_train_clean.columns) == set(clean_azdias.columns)
There seems to be some change in features in cleaned MAILOUT.
# Check for difference in features in both datasets
print("Features in clean MAILOUT not in clean AZDIAS:", set(mailout_train_clean.columns).difference(clean_azdias.columns))
print("Features in clean AZDIAS not in clean MAILOUT:", set(clean_azdias.columns).difference(mailout_train_clean.columns))
So there are some features that we have in MAILOUT that weren't dropped, so if we opted to use these there are some points that we need to take care of:
So they could be concatenated with the results of the original pipeline and cleaning separetly if wanted to use them, or we could just drop them.
mailout_train_clean = clean_dataset(mailout_train_new, keep_features=["KBA13_ANTG4"])
# Check null percentages in columns and rows
fig = plt.figure(figsize=(14, 4))
plt.subplot(1, 2, 1)
plt.title("Null Percentages in Columns")
mailout_null_columns = get_null_prop(mailout_train_clean, axis=0, plot=True)
plt.subplot(1, 2, 2)
plt.title("Null Percentages in Rows")
mailout_null_rows = get_null_prop(mailout_train_clean, axis=1, plot=True)
plt.tight_layout()
plt.savefig("mailout_null_after.png")
I'll fill these columns with an ad-hoc most frequent imputation just to make the baseline model. But I first need to take a look at these features that weren't included in our previous analysis.
new_feats = ['D19_BUCH_CD', 'VHA', 'EXTSEL992', 'D19_SOZIALES',
'D19_VOLLSORTIMENT', 'RESPONSE', 'AGER_TYP', 'D19_SONSTIGE']
mailout_train_clean[new_feats].head()
mailout_train_clean[new_feats].hist(figsize=(14, 14));
plt.tight_layout()
plt.savefig("mailout_new_feats.png");
mailout_train_clean[new_feats].isna().sum()/mailout_train_clean.shape[0]
I have noticed that RESPONSE is extremely imbalanced, which means that results of the baseline will probably be really bad, which is good because it will give us room for improvement.
Right now I'll impute old features using previously made imputer and then impute new features based on the method we have used before, which is mean value for every feature as all of them have more than 10% missing values.
# Impute clean AZDIAS features using old imputer
mailout_train_clean.loc[:, clean_azdias.columns] = imputer.transform(mailout_train_clean[clean_azdias.columns])
# Make new imputer for remaining features
mailout_imputer = SimpleImputer(strategy="mean")
# Fit imputer and impute the remaining columns in place
mailout_train_clean.loc[:, :] = mailout_imputer.fit_transform(mailout_train_clean)
# Assert if all null values have been removed
assert (mailout_train_clean.isna().sum() > 0).any() == False
Finally we should check for any null values in RESPONSE.
mailout_train_clean["RESPONSE"].isna().sum()
# Split training data into X and y
X = mailout_train_clean.loc[:, mailout_train_clean.columns != "RESPONSE"]
y = mailout_train_clean.loc[:, "RESPONSE"]
The model that I have in mind is RandomForestClassifier.
In order to evaluate this baseline, we need to have some sort of validation set in order to score our results.
For validation I'll use Stratified KFold cross validation (in order to account for RESPONSE imbalance), and I'll use sklearn's classfication report which shows recall, precision and f1-score for each label.
I'll also use ROC AUC score since it's the score of the final competition.
model_validation(RandomForestClassifier(random_state=seed), X, y)
So as you can see, the results are really bad, and the silver lining of these results as that there is plenty of room for imporvement.
First we know that the target is extremely imbalanced, so we need to address this by using over-sampling or under-sampling techniques with the machine learning algorithm that we'll use.
Fortunately, this is provided by imblearn package.
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
We can see that this model still has bad performance, but it's siginifcantly better than the regular RandomForestClassifier.
The best thing about classification_report is that it gives us a variety of metrics that we can use to judge our model.
That depends on margin of error that we are willing to except with our model. So for example, we might want a model that is able to predict all indivduals likely to be customers, despite predicting a large sum of individuals that won't be customers. In this case we'd want a model with higher Recall.
On the other hand, sending out to a huge mass might be expensive and counter-intuitive business-wise in some case. In this case we'd want a model with higher Precision.
In order to get the best of both worlds, we can use the F1-Score, which calculate a harmonic average of Recall and Precision, penalizing a model that has bad scores for either one of the metrics.
Personally I prefer using F1-Score, specifically the macro-averaged F1-Score, which calculate the average of F1-Score for each class regardless the number of data points belonging to that class, as a weighted average F1-Score wouldn't account the main class we are concerned with predicting due to it's extremely low number.
It also looks like ROC AUC Macro is identical to Recall Macro, so we can use Recall as a proxy for ROC AUC instead of calculating it twice.
I think that getting a model with high precision with this dataset would be a stretch, so my best guess right now is to use Recall, but the other metric to avoid a model with really bad precision.
# Filter for features used in clean_azdias
mailout_train_pre_pca = mailout_train_clean[clean_azdias.columns]
# Scale the features
mailout_train_pre_pca = scaler.transform(mailout_train_pre_pca)
# Reduce dimensions using pca to d dimensions
X = pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d])
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
The results of PCA transformed training set with BalancedRandomForestClassifier is worse than using the vanilla dataset.
We could spot-check multiple algorithms to prove if this is the case.
Instead of implementing under-sampling or over-sampling in validation, we can use BalancedBaggingClassifier to wrap the classfication algortithms that we are interested in using. But we'll need to scale the data because algorithms like Logistic Regression, KNN and SVM can perform better after scaling the data.
bagging_model = lambda model: BalancedBaggingClassifier(model)
models = {"RF": BalancedRandomForestClassifier(random_state=seed),
"LR": bagging_model(LogisticRegression(max_iter=1000, random_state=seed)),
"KNN": bagging_model(KNeighborsClassifier()),
"Linear SVM": bagging_model(SVC(kernel="linear", random_state=seed)),
"RBF SVM": bagging_model(SVC(kernel="rbf", random_state=seed))}
pca_bagging_scores = evaluate_models(models, X, y, "recall_macro")
We can see that Bagging using LogisitcRegression and RBF SVM exceeded the recall of BalancedRandomForestClassifier. However we have to note that we didn't increase the number of estimators, and we still don't know the exact recall of the responsive class.
But we'll need to scale the data because algorithms like Logistic Regression, KNN and SVM can perform better after scaling the data.
# Split training data into X and y
X = mailout_train_clean.loc[:, mailout_train_clean.columns != "RESPONSE"]
y = mailout_train_clean.loc[:, "RESPONSE"]
scaled_pipeline = lambda model: make_pipeline(StandardScaler(), model)
models = {"RF": BalancedRandomForestClassifier(random_state=seed),
"LR": scaled_pipeline(bagging_model(LogisticRegression(max_iter=1000, random_state=seed))),
"KNN": scaled_pipeline(bagging_model(KNeighborsClassifier())),
"Linear SVM": scaled_pipeline(bagging_model(SVC(kernel="linear", random_state=seed))),
"RBF SVM": scaled_pipeline(bagging_model(SVC(kernel="rbf", random_state=seed)))}
original_bagging_scores = evaluate_models(models, X, y, "recall_macro")
By comparing the results, we can see that all of the algorithms (except Bagged KNN) perform better using the original data. However the improvement in all of them compared to BalancedRandomForestClassifier is tiny, so we need not continue in exploring them.
In terms of the significant improvement in BalancedRandomForestClassifier, this could be for two reasons:
# Make new dataframe with only new feats
mailout_train_new_feats = mailout_train_clean.loc[:, new_feats]
# Concatenate PCA transformed dataframe with new feats dataframe
X = pd.concat([pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d]), mailout_train_new_feats.reset_index(drop=True)], axis=1)
# Drop RESPONSE
X = X.drop(columns=["RESPONSE"])
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
We can see that these features have definetly improved upon our PCA results, and even made it better than the original dataset score.
# Concatenate PCA transformed dataframe with new feats dataframe
X = pd.concat([pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d]), mailout_train_clean.reset_index(drop=True)], axis=1)
# Drop RESPONSE
X = X.drop(columns=["RESPONSE"])
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
We can see that this has worsened the results. It seems that the PCA transformer features are able to provide info that is better than the original features before transformation, and the presence of both at the same time doesn't improve the model performance at all.
# transform mailout train data using PCA
mailout_train_pca = pd.DataFrame(pca.transform(mailout_train_pre_pca)[:, :d],
columns=[f"pca_{i}" for i in range(d)])
# predict mailout cluster distances
mailout_distances = pd.DataFrame(kmeans.transform(mailout_train_pca),
columns=[f"cluster_{i}" for i in range(kmeans.cluster_centers_.shape[0])])
# predict mailout cluster
mailout_clusters = pd.Series(kmeans.predict(mailout_train_pca), name='label')
# visualize mailout clusters
mailout_clusters.value_counts().plot(kind="bar", title="How Are Clusters Distributed in MAILOUT Data?");
plt.tight_layout()
plt.savefig("mailout_clusters.png");
If we were to use these clusters directly, we would predict that the majority of the individuals in the mailout data would respond, but we know from the labels than only a very small portion of them actually responded.
# concatenate cluster labels to the clean dataset
X = pd.concat([mailout_train_pca, mailout_train_new_feats.reset_index(drop=True), mailout_distances], axis=1)
# drop RESPONSE
X = X.drop(columns=["RESPONSE"])
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
The results are worse than just using PCA features and new features.
I think that for this we could also test using other models, since the number of features it very small. So we could try other models using undersampling.
# concatenate cluster labels to the clean dataset
X = pd.concat([mailout_train_new_feats.reset_index(drop=True), mailout_distances], axis=1)
# drop RESPONSE
X = X.drop(columns=["RESPONSE"])
model_validation(BalancedRandomForestClassifier(n_estimators=1000, random_state=seed), X, y)
We can see that the results have significantly improved, which indicates that the presence of all original features isn't useful for the model.
There are several methods present in scikit-learn for feature selection, but I'll use SelectKBest which removes all but the highest scoring K features.
# Concatenate all features
X = pd.concat([mailout_train_pca, mailout_train_clean.reset_index(drop=True), mailout_distances, mailout_clusters], axis=1)
# Drop RESPONSE
X = X.drop(columns=["RESPONSE"])
for k in [10, 30, 50, 100, 300, 500]:
print("K:", k)
# make pipeline
pipeline = make_pipeline(SelectKBest(k=k),
BalancedRandomForestClassifier(n_estimators=1000, random_state=seed))
model_validation(pipeline, X, y)
print()
pca_feats = list(mailout_train_pca.columns)
azdias_feats = list(clean_azdias.columns)
for k in [10, 30, 50]:
print("K:", k)
pipeline = Pipeline([
('feature_selection', ColumnTransformer([
('kbest', SelectKBest(k=k), pca_feats+azdias_feats),
], remainder='passthrough')),
('clf', BalancedRandomForestClassifier(n_estimators=1000, random_state=seed))
])
model_validation(pipeline, X, y)
print()
Now I'll make a function to document all of the steps we did for preparing the dataset, so we'll be able to do the same in testing.
def prepare_mailout(df, p=0.5, test=False):
"""Prepare MAILOUT training and testing dataset for ML Pipeline."""
# Set dropping threshold to 1.0 for test set
if test:
p = 1.0
# Clean the dataset
df_clean = clean_dataset(df, p_row=p, p_col=p, keep_features=["KBA13_ANTG4"])
# Drop RESPONSE if train set
if test:
y = None
else:
y = df_clean["RESPONSE"]
df_clean.drop("RESPONSE", axis=1, inplace=True)
# Filter features used in train set only
if test:
train_feats = pickle.load(open("mailout_train_feats.pkl", "rb"))
df_clean = df_clean.loc[:, train_feats]
else:
train_feats = list(df_clean.columns)
pickle.dump(train_feats, open("mailout_train_feats.pkl", "wb"))
# Missing values
# Impute clean AZDIAS features using old imputer
azdias_imputer = pickle.load(open("imputer.pkl", "rb"))
df_clean.loc[:, clean_azdias.columns] = azdias_imputer.transform(df_clean[clean_azdias.columns])
# Impute remaning features
if test:
# Load mailout imputer for test set
mailout_imputer = pickle.load(open("mailout_imputer.pkl", "rb"))
df_clean = pd.DataFrame(mailout_imputer.transform(df_clean), columns=df_clean.columns)
else:
# Fit imputer for train set and pickle to load with test set
mailout_imputer = SimpleImputer(strategy="mean")
df_clean = pd.DataFrame(mailout_imputer.fit_transform(df_clean), columns=df_clean.columns)
pickle.dump(mailout_imputer, open("mailout_imputer.pkl", "wb"))
# PCA features
df_pre_pca = df_clean[clean_azdias.columns]
df_pre_pca_scaled = scaler.transform(df_pre_pca)
df_pca = pd.DataFrame(pca.transform(df_pre_pca_scaled)[:, :d],
columns=[f"pca_{i}" for i in range(d)])
# Cluster distances
df_distances = pd.DataFrame(kmeans.transform(df_pca),
columns=[f"cluster_{i}" for i in range(kmeans.cluster_centers_.shape[0])])
# Cluster labels
df_clusters = pd.Series(kmeans.predict(df_pca), name='label')
# Concatenate all features
X = pd.concat([df_clean, df_pca, df_distances, df_clusters], axis=1)
return X, y
# Prepare training set
X_train, y_train = prepare_mailout(mailout_train)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('k_best',SelectKBest()),
('clf', BalancedRandomForestClassifier(random_state=seed))
])
pipeline_grid = {
"k_best__k": [int(x) for x in np.linspace(start=5, stop=30, num=5)],
"clf__n_estimators": [int(x) for x in np.linspace(start=1000, stop=3000, num=5)],
"clf__max_depth": [int(x) for x in np.linspace(10, 110, num = 5)] + [None],
"clf__min_samples_split": [2, 5, 10],
"clf__min_samples_leaf": [1, 2, 4],
"clf__bootstrap": [True, False],
}
combs = 1
for name, params in pipeline_grid.items():
combs *= len(params)
print("Total number of combinations in parameters grid:", combs)
# # Instantiate StratifiedKFold object for CV
# skf = StratifiedKFold(n_splits=3)
# # Use RandomizedSearch to find the best hyperparameters combination
# pipeline_random = RandomizedSearchCV(estimator=pipeline, param_distributions=pipeline_grid,
# scoring='recall_macro', n_iter=50, cv=skf, verbose=3,
# random_state=seed, n_jobs=-1)
# # Fit the RandomizedSearch model to the training data
# pipeline_random.fit(X_train.values, y_train.values)
# pickle.dump(pipeline_random, open("pipeline_random.pkl", "wb"))
pipeline_random = pickle.load(open("pipeline_random.pkl", "rb"))
best_params = pipeline_random.best_params_
print("Best parameters found:", best_params)
pipeline.set_params(**best_params)
model_validation(pipeline, X_train, y_train, final_results=True, plot_confusion=True)
plt.tight_layout()
plt.savefig("confusion_matrix.png")
pipeline_recall = 0.7696068445499852
baseline_recall = 0.7085381814866175
percent_increase = (pipeline_recall - baseline_recall) * 100 / baseline_recall
print("The final pipeline improved Macro Recall results by {:.2f}%".format(percent_increase))
The results aren't state of the art, but we can see that we are able to predict 84% of the responsive individuals, and only 30% of non-responsive ones are false positives.
Comparing the results to the baseline model, the Macro Recall has improved by 8.62%
Now the final the thing that we can do is to tune the decision boundary to get the best results that we can get using this model. In order to do this, we need to predict the probability of classes first.
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
# Make empty y_pred to fill predictions of each fold
y_pred = np.zeros((y_train.shape[0], 2))
for i, (train_idx, test_idx) in enumerate(skf.split(X_train, y_train)):
X_tr, X_test = X_train.iloc[train_idx], X_train.iloc[test_idx]
y_tr, y_test = y_train.iloc[train_idx], y_train.iloc[test_idx]
# Make copy of model
model_clone = clone(pipeline)
# Fit model
model_clone.fit(X_tr, y_tr)
# Predict fold y_test and add in y_pred
fold_pred = model_clone.predict_proba(X_test)
y_pred[test_idx, :] += fold_pred
threshs = np.linspace(0.4, 0.6, 20)
f1_scores = []
recall_scores = []
precision_scores = []
for thresh in threshs:
thresh_pred = (y_pred[:, 1] > thresh).astype(int)
f1_scores.append(f1_score(y_train, thresh_pred, average="macro"))
recall_scores.append(recall_score(y_train, thresh_pred, average="macro"))
precision_scores.append(precision_score(y_train, thresh_pred, average="macro", zero_division=1))
best_recall_thresh = threshs[np.argmax(recall_scores)]
print("Threshold optimizing Recall: {:.3f}".format(best_recall_thresh))
plt.figure(figsize=(8, 4))
# plt.plot(threshs, f1_scores, label="F1");
plt.plot(threshs, recall_scores, label="Recall");
plt.plot(threshs, precision_scores, label="Precision");
plt.axvline(best_recall_thresh, linestyle='--', color='r')
plt.legend();
plt.title("Precision & Recall at Different Prediction Thresholds");
plt.tight_layout()
plt.savefig("thresholds.png");
best_thresh_pred = (y_pred[:, 1] > best_recall_thresh).astype(int)
ax = plt.subplot()
cmat = confusion_matrix(y_train, best_thresh_pred)
sns.heatmap(cmat, annot=True, fmt="g", ax=ax)
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['No Response', 'Response']); ax.yaxis.set_ticklabels(['No Response', 'Response']);
plt.tight_layout()
plt.savefig("confusion_matrix1.png")
We can see that the best threshold slightly improved the precision of the model, by decreasing the number of false positives from 10327 to 10283.
In each step we have done in the pipeline so far, there must have been other ways we could have achieved the same or even better results. But speaking from a business perspective, if this dataset provided the data one previous campaign were the conversion rate was minute as seen, using this model which isn't particularly state of the art, would definitely decrease the costs and increase the conversion rate of the campaign due to the increased selectivity.
pipeline.set_params(**best_params)
pipeline.fit(X_train, y_train)
pickle.dump(pipeline, open("final_pipeline.pkl", "wb"))
Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!
Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.
mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
mailout_test.head()
mailout_test.shape
# Copying LNR column for Kaggle submission csv file
test_LNR = mailout_test['LNR']
X_test, _ = prepare_mailout(mailout_test, test=True)
# Check that all columns are the same between training and test set
assert set(X_train.columns) == set(X_test.columns)
test_preds = pipeline.predict_proba(X_test)
submission = pd.DataFrame({'LNR':test_LNR.astype(np.int32), 'RESPONSE':test_preds[:, 1]})
submission.head()
submission.to_csv('kaggle.csv', index=False)
To summarize what we have done in the project so far:
Cleaining the general population dataset was really challenging for me, as it was the first time I've ever dealt with dataset this size, and I didn't know where to begin in exploring it.
This has forced me to find some methods to be able to digest the data in smaller portions to get a general idea about how it should be dealt with, like to into categories separately.
Also, I really enjoyed the customer segmentation part, even though it could be improved since we only tried reducing dimensionality using PCA and clustered using K-Means, where we could have used different algorithms for clustering such as DBSCAN, Agglomerative Clustering or Gaussian Mixture Models.